On the Convergence of Stochastic Iterative Dynamic Programming Algorithms

نویسندگان

  • Tommi S. Jaakkola
  • Michael I. Jordan
  • Satinder P. Singh
چکیده

Recent developments in the area of reinforcement learning have yielded a number of new algorithms for the prediction and control of Markovian environments These algorithms including the TD algo rithm of Sutton and the Q learning algorithm of Watkins can be motivated heuristically as approximations to dynamic program ming DP In this paper we provide a rigorous proof of convergence of these DP based learning algorithms by relating them to the powerful techniques of stochastic approximation theory via a new convergence theorem The theorem establishes a general class of convergent algo rithms to which both TD and Q learning belong An important component of many real world learning problems is the tem poral credit assignment problem the problem of assigning credit or blame to individual components of a temporally extended plan of action based on the success or failure of the plan as a whole To solve such a problem the learner must be equipped with the ability to assess the long term consequences of particular choices of action and must be willing to forego an immediate payo for the prospect of a longer term gain Moreover because most real world problems involving prediction of the future consequences of actions involve substantial uncertainty the learner must be prepared to make use of a proba bility calculus for assessing and comparing actions There has been increasing interest in the temporal credit assignment prob lem due principally to the development of learning algorithms based on the theory of dynamic programming DP Barto Sutton Watkins Wer bos Sutton s TD algorithm addressed the problem of learning to predict in a Markov environment utilizing a temporal di erence operator to update the predictions Watkins Q learning algorithm extended Sutton s work to control problems and also clari ed the ties to dynamic pro gramming In the current paper our concern is with the stochastic convergence of DP based learning algorithms Although Watkins and Watkins and Dayan proved that Q learning converges with probability one and Dayan observed that TD is a special case of Q learning and there fore also converges with probability one these proofs rely on a construction that is particular to Q learning and fail to reveal the ties of Q learning to the broad theory of stochastic approximation e g Wasan Our goal here is to provide a simpler proof of convergence for Q learning by making direct use of stochastic approximation theory We also show that our proof extends to TD for arbitrary Several other authors have recently pre sented results that are similar to those presented here Dayan and Sejnowski for TD Peng and Williams for TD and Tsitsiklis for Q learning Our results appear to be closest to those of Tsitsiklis We begin with a general overview of Markov decision problems and DP We introduce the Q learning algorithm as a stochastic form of DP We then present a proof of convergence for a general class of stochastic processes of which Q learning is a special case We then discuss TD and show that it is also a special case of our theorem Markov decision problems A useful mathematical model of temporal credit assignment problems studied in stochastic control theory Aoki and operations research Ross is the Markov decision problem Markov decision problems are built on the formalism of controlled Markov chains Let S N be a discrete state space and let U i be the discrete set of actions available to the learner when the chain is in state i The probability of making a transition from state i to state j is given by pij u where u U i The learner de nes a policy which is a function from states to actions Associated with every policy is a Markov chain de ned by the state transition probabilities pij i There is an instantaneous cost ci u associated with each state i and action u where ci u is a random variable with expected value ci u We also de ne a value function V i which is the expected sum of discounted future costs given that the system begins in state i and follows policy

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Robust inter and intra-cell layouts design model dealing with stochastic dynamic problems

In this paper, a novel quadratic assignment-based mathematical model is developed for concurrent design of robust inter and intra-cell layouts in dynamic stochastic environments of manufacturing systems. In the proposed model, in addition to considering time value of money, the product demands are presumed to be dependent normally distributed random variables with known expectation, variance, a...

متن کامل

Convergence of Stochastic Iterative Dynamic Programming Algorithms

Increasing attention has recently been paid to algorithms based on dynamic programming (DP) due to the suitability of DP for learning problems involving control. In stochastic environments where the system being controlled is only incompletely known, however, a unifying theoretical account of these methods has been missing. In this paper we relate DP-based learning algorithms to the powerful te...

متن کامل

Learning Algorithms for Risk-Sensitive Control

This is a survey of some reinforcement learning algorithms for risk-sensitive control on infinite horizon. Basics of the risk-sensitive control problem are recalled, notably the corresponding dynamic programming equation and the value and policy iteration methods for its solution. Basics of stochastic approximation algorithms are also sketched, in particular the ‘o.d.e.’ approach for its stabil...

متن کامل

Stochastic Dynamic Programming with Markov Chains for Optimal Sustainable Control of the Forest Sector with Continuous Cover Forestry

We present a stochastic dynamic programming approach with Markov chains for optimal control of the forest sector. The forest is managed via continuous cover forestry and the complete system is sustainable. Forest industry production, logistic solutions and harvest levels are optimized based on the sequentially revealed states of the markets. Adaptive full system optimization is necessary for co...

متن کامل

Modelling and Decision-making on Deteriorating Production Systems using Stochastic Dynamic Programming Approach

This study aimed at presenting a method for formulating optimal production, repair and replacement policies. The system was based on the production rate of defective parts and machine repairs and then was set up to optimize maintenance activities and related costs. The machine is either repaired or replaced. The machine is changed completely in the replacement process, but the productio...

متن کامل

On new faster fixed point iterative schemes for contraction operators and comparison of their rate of convergence in convex metric spaces

In this paper we present new iterative algorithms in convex metric spaces. We show that these iterative schemes are convergent to the fixed point of a single-valued contraction operator. Then we make the comparison of their rate of convergence. Additionally, numerical examples for these iteration processes are given.

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Neural Computation

دوره 6  شماره 

صفحات  -

تاریخ انتشار 1994